Japanese Text Normalization with Encoder-Decoder Model
نویسندگان
چکیده
Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the form of parallel corpora. To address this issue, we propose a method of data augmentation to increase data size by converting existing resources into synthesized non-standard forms using handcrafted rules. We conducted an experiment to demonstrate that the synthesized corpus contributes to stably train an encoder-decoder model and improve the performance of Japanese text normalization.
منابع مشابه
Improving Neural Text Normalization with Data Augmentation at Character- and Morphological Levels
In this study, we investigated the effectiveness of augmented data for encoderdecoder-based neural normalization models. Attention based encoder-decoder models are greatly effective in generating many natural languages. In general, we have to prepare for a large amount of training data to train an encoderdecoder model. Unlike machine translation, there are few training data for textnormalizatio...
متن کاملStructured-based Curriculum Learning for End-to-end English-Japanese Speech Translation
Sequence-to-sequence attentional-based neural network architectures have been shown to provide a powerful model for machine translation and speech recognition. Recently, several works have attempted to extend the models for end-to-end speech translation task. However, the usefulness of these models were only investigated on language pairs with similar syntax and word order (e.g., English-French...
متن کاملJoint Source-Channel Coding with Neural Networks for Analog Data Compression and Storage
We provide an encoding and decoding strategy for efficient storage of analog data onto an array of Phase-Change Memory (PCM) devices. The PCM array is treated as an analog channel, with the stochastic relationship between write voltage and read resistance for each device determining its theoretical capacity. The encoder and decoder are implemented as neural networks with parameters that are tra...
متن کاملLearning attention for historical text normalization by learning to pronounce
Automated processing of historical texts often relies on pre-normalization to modern word forms. Training encoder-decoder architectures to solve such problems typically requires a lot of training data, which is not available for the named task. We address this problem by using several novel encoder-decoder architectures, including a multi-task learning (MTL) architecture using a grapheme-to-pho...
متن کاملSentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction
We demonstrate that an attention-based encoder-decoder model can be used for sentence-level grammatical error identification for the Automated Evaluation of Scientific Writing (AESW) Shared Task 2016. The attention-based encoder-decoder models can be used for the generation of corrections, in addition to error identification, which is of interest for certain end-user applications. We show that ...
متن کامل